Two-Armed Bandit Problem, Data Processing, and Parallel Version of the Mirror Descent Algorithm
نویسندگان
چکیده
We consider the minimax setup for the two-armed bandit problem as applied to data processing if there are two alternative processing methods available with different a priori unknown efficiencies. One should determine the most effective method and provide its predominant application. To this end we use the mirror descent algorithm (MDA). It is well-known that corresponding minimax risk has the order N with N being the number of processed data. We improve significantly the theoretical estimate of the factor using MonteCarlo simulations. Then we propose a parallel version of the MDA which allows processing of data by packets in a number of stages. The usage of parallel version of the MDA ensures that total time of data processing depends mostly on the number of packets but not on the total number of data. It is quite unexpectedly that the parallel version behaves unlike the ordinary one even if the number of packets is large. Moreover, the parallel version considerably improves control performance because it provides significantly smaller value of the minimax risk. We explain this result by considering another parallel modification of the MDA which behavior is close to behavior of the ordinary version. Our estimates are based on invariant descriptions of the algorithms. All estimates are obtained by Monte-Carlo simulations. It’s worth noting that parallel version performs well only for methods with close efficiencies. If efficiencies differ significantly then one should use the combined algorithm which at initial sufficiently short control horizon uses ordinary version and then switches to the parallel version of the MDA. Keywords— two-armed bandit problem, control in a random environment, minimax approach, robust control, mirror descent algorithm, parallel processing.
منابع مشابه
Gap-free Bounds for Stochastic Multi-Armed Bandit
We consider the stochastic multi-armed bandit problem with unknown horizon. We present a randomized decision strategy which is based on updating a probability distribution through a stochastic mirror descent type algorithm. We consider separately two assumptions: nonnegative losses or arbitrary losses with an exponential moment condition. We prove optimal (up to logarithmic factors) gap-free bo...
متن کاملCorralling a Band of Bandit Algorithms
We study the problem of combining multiple bandit algorithms (that is, online learning algorithms with partial feedback) with the goal of creating a master algorithm that performs almost as well as the best base algorithm if it were to be run on its own. The main challenge is that when run with a master, base algorithms unavoidably receive much less feedback and it is thus critical that the mas...
متن کاملOnline Learning with Partial Feedback
In previous lectures we talked about the general framework of online convex optimization and derived an algorithm for prediction with expert advice from this general framework. To apply the online algorithm, we need to know the gradient of the loss function at the end of each round. In the prediction of expert advice setting, this boils down to knowing the cost of each individual expert. In thi...
متن کاملAsynchronous Parallel Empirical Variance Guided Algorithms for the Thresholding Bandit Problem
This paper considers the multi-armed thresholding bandit problem – identifying all arms whose expected rewards are above a predefined threshold via as few pulls (or rounds) as possible – proposed by Locatelli et al. (2016) recently. Although the proposed algorithm in Locatelli et al. (2016) achieves the optimal round complexity in a certain sense, there still remain unsolved issues. This paper ...
متن کاملMore Adaptive Algorithms for Adversarial Bandits
We develop a novel and generic algorithm for the adversarial multi-armed bandit problem (or more generally the combinatorial semi-bandit problem). When instantiated differently, our algorithm achieves various new data-dependent regret bounds improving previous work. Examples include: 1) a regret bound depending on the variance of only the best arm; 2) a regret bound depending on the first-order...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017